How should I price auto insurance in the United States?

Introduction

Business Context. The ability to price an insurance quote properly has a significant impact on insurers' management decisions and financial statements. You are the chief data scientist at a new startup insurance company focusing on providing affordable insurance to millennials. You are tasked to assess the current state of insurance companies to see what factors large insurance providers charge premiums for. Fortunately for you, your company has compiled a dataset by surveying what people currently pay for insurance from large companies. Your findings will be used as the basis of developing your company's millenial car insurance offering.

Business Problem. Your task is to build a minimal model to predict the cost of insurance from the data set using various characteristics of a policyholder.

Analytical Context. The data resides in a CSV file which has been pre-cleaned for you and can directly be read in. Throughout the case, you will be iterating on your initial model many times based on common pitfalls that arise which we discussed in previous cases. You will be using the Python statsmodels package to create and analyze these linear models.

In [51]:
### Load relevant packages

import pandas                  as pd
import numpy                   as np
import matplotlib.pyplot       as plt
import seaborn                 as sns
import statsmodels.api         as sm
import statsmodels.formula.api as smf
import os
import string
import re
import sklearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
## For a map
import branca
import geopandas
import scipy
from scipy import stats

import folium # package for making maps, please make sure to use a version older than 1.0.0.
from IPython.display import display
from folium.plugins import TimeSliderChoropleth

# This statement allow to display plot without asking to 
%matplotlib inline

# always make it pretty 
sns.set_style('whitegrid')
plt.style.use('seaborn-whitegrid')

Diving into the data

In [52]:
df = pd.read_csv('Allstate-cost-cleaned.csv', index_col=0,
    dtype = { # indicate categorical variables
        'A': 'category',
        'B': 'category',
        'C': 'category',
        'D': 'category',
        'E': 'category',
        'F': 'category',
        'G': 'category',
        'car_value': 'category',
        'state': 'category'
    }
)

The following are the columns in the dataset:

  1. state: State where shopping point occurred
  2. group_size: How many people will be covered under the policy (1, 2, 3 or 4)
  3. homeowner: Whether the customer owns a home (0=no, 1=yes)
  4. car_age: Age of the customer's car (How old the car is)
  5. car_value: Value of the car when it was new
  6. risk_factor: An ordinal assessment of how risky the customer is (0,1, 2, 3, 4)
  7. age_oldest: Age of the oldest person in customer's group
  8. age_youngest: Age of the youngest person in customer's group
  9. married_couple: Does the customer group contain a married couple (0=no, 1=yes)
  10. C_previous: What the customer formerly had or currently has for product option C (0=nothing, 1, 2, 3,4)
  11. duration_previous: How long (in years) the customer was covered by their previous issuer
  12. A,B,C,D,E,F,G: The coverage options:
  13. A: Collision (levels: 0, 1, 2);
  14. B: Towing (levels: 0, 1);
  15. C: Bodily Injury (BI, levels: 1, 2, 3, 4);
  16. D: Property Damage (PD, levels 1, 2, 3);
  17. E: Rental Reimbursement (RR, levels: 0, 1);
  18. F: Comprehensive (Comp, levels: 0, 1, 2, 3);
  19. G: Medical/Personal Injury Protection (Med/PIP, levels: 1, 2, 3, 4)
  20. cost: cost of the quoted coverage options
In [53]:
print(df.shape)
df.head(10)
(15483, 19)
Out[53]:
state group_size homeowner car_age car_value risk_factor age_oldest age_youngest married_couple C_previous duration_previous A B C D E F G cost
0 OK 1 0 9 f 0.0 24 24 0 3.0 9.0 0 0 1 1 0 0 4 543
1 OK 1 0 9 f 0.0 24 24 0 3.0 9.0 2 1 1 3 1 3 2 611
2 PA 1 1 7 f 0.0 74 74 0 2.0 15.0 2 0 2 3 1 2 2 691
3 PA 1 1 7 f 0.0 74 74 0 2.0 15.0 2 0 2 3 1 2 2 695
4 AR 1 0 4 d 4.0 26 26 0 3.0 1.0 1 0 1 1 0 2 2 628
5 AR 1 0 4 d 4.0 26 26 0 3.0 1.0 1 0 2 1 0 2 2 625
6 AR 1 0 4 d 4.0 26 26 0 3.0 1.0 1 0 2 1 0 2 2 628
7 OK 1 0 13 f 3.0 22 22 0 0.0 0.0 0 0 1 1 0 0 2 596
8 OK 1 0 13 f 3.0 22 22 0 0.0 0.0 2 0 1 1 0 3 2 711
9 OK 1 0 13 f 3.0 22 22 0 0.0 0.0 2 0 1 1 0 3 2 722

Exercise 1 :

Write code to visualize the relationship between cost and the following variables. Choose your plots judiciously based on what you know about each variable. Different variable types (categorical vs. numerical) should have different types of plots (e.g. scatter, boxplot, violin plot, etc.) Group your plots together using the plt.subplot() function.

  1. car_age
  2. age_oldest
  3. age_youngest
  4. duration_previous
  5. C_previous
  6. homeowner
  7. group_size
  8. car_age
  9. Categories A-G (7 different plots)

Answer.


The results of exploratory graphs for different variables are shown below. The ordinal categorical variables, which were treated as continuous variables in the regression, were plotted as violinplots. Continuous variables were plotted as scatter charts to which a trend line with LOESS regression was added.

In [54]:
fig = plt.figure(figsize=(12,10))
vars_to_look = ['car_age', 'age_oldest','age_youngest', 'duration_previous', 'C_previous', 'homeowner', 'group_size',
               'car_value']

for i, var in enumerate(vars_to_look):
    plt.subplot(3,3,i+1)
    if var in ['homeowner', 'C_previous', 'group_size', 'car_value']:
        sns.violinplot(x=df[var], y = df['cost'], palette='viridis', 
                    boxprops={'alpha':0.5}, flierprops={'marker':'o'})
        plt.title("Violinplot of " + var) 
    else:
        sns.regplot(x=df[var], y = df['cost'], lowess=True, 
                    line_kws = {'color': "red", 'ls':'--'},
                    scatter_kws = {"s": 10, 'alpha':0.5, 'color':'blue'} )
        plt.title("Scatterplot of " + var) 

fig.set_tight_layout(True)
In [55]:
fig = plt.figure(figsize=(14,6))
vars_to_look = [i for i in string.ascii_uppercase[:7]]

for i, var in enumerate(vars_to_look):
    plt.subplot(2,4,i+1)
    sns.violinplot(x=df[var], y = df['cost'], palette='viridis', 
                    boxprops={'alpha':0.5}, flierprops={'marker':'o'})
    plt.title("Violinplot of coverage option: " + var)

fig.set_tight_layout(True)

An inversely proportional relationship is observed between the variables car_age, age_oldest, age_youngest, duration_previous and the variable cost. With the rest of the variables, no important relationships were observed.

Exercise 2 :

Convert all categorical data to be in the one-hot encoding format.

Answer.

In [56]:
cat_variables = df.select_dtypes(exclude=['number', 'bool']).columns
df_cat = pd.get_dummies(df, columns=cat_variables, drop_first=True)
df_cat
Out[56]:
group_size homeowner car_age risk_factor age_oldest age_youngest married_couple C_previous duration_previous cost ... C_4 D_2 D_3 E_1 F_1 F_2 F_3 G_2 G_3 G_4
0 1 0 9 0.0 24 24 0 3.0 9.0 543 ... 0 0 0 0 0 0 0 0 0 1
1 1 0 9 0.0 24 24 0 3.0 9.0 611 ... 0 0 1 1 0 0 1 1 0 0
2 1 1 7 0.0 74 74 0 2.0 15.0 691 ... 0 0 1 1 0 1 0 1 0 0
3 1 1 7 0.0 74 74 0 2.0 15.0 695 ... 0 0 1 1 0 1 0 1 0 0
4 1 0 4 4.0 26 26 0 3.0 1.0 628 ... 0 0 0 0 0 1 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15478 1 0 2 1.0 70 70 0 4.0 9.0 643 ... 1 0 1 1 0 1 0 0 1 0
15479 1 0 2 1.0 70 70 0 4.0 9.0 643 ... 1 0 1 1 0 1 0 0 1 0
15480 1 0 2 1.0 70 70 0 4.0 9.0 647 ... 1 0 1 1 0 1 0 0 1 0
15481 1 1 0 3.0 25 25 0 2.0 6.0 642 ... 0 0 1 0 0 0 0 1 0 0
15482 1 1 0 3.0 25 25 0 2.0 6.0 646 ... 0 0 1 0 0 0 0 1 0 0

15483 rows × 68 columns

Fitting a multiple linear regression

Exercise 3 :

Split your data into training and testing sets (an 80-20 split is a good starting point).

Note: Keep random seed as 1337 in the code cell

Answer.

In [57]:
# 
train, test = train_test_split(df_cat, test_size=0.2, random_state=1337)
In [58]:
[print(i,'shape \t->', eval(i).shape) for i in ['train', 'test']];
train shape 	-> (12386, 68)
test shape 	-> (3097, 68)

Exercise 4 :

4.1

Fit a multiple linear regression model to the training data regressing cost against all the other variables. Call this model_all. What is the AIC value?

Answer.

In [59]:
ls = list(train.columns)      # List all column names
ls.remove('cost')             # Remove 'cost' variable

model_all = smf.ols(formula = 'cost ~ ' + ' + '.join(ls), 
                   data = train).fit()
model_all.summary()
Out[59]:
OLS Regression Results
Dep. Variable: cost R-squared: 0.436
Model: OLS Adj. R-squared: 0.432
Method: Least Squares F-statistic: 141.8
Date: Tue, 06 Oct 2020 Prob (F-statistic): 0.00
Time: 13:40:17 Log-Likelihood: -61793.
No. Observations: 12386 AIC: 1.237e+05
Df Residuals: 12318 BIC: 1.242e+05
Df Model: 67
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 678.9815 5.606 121.114 0.000 667.993 689.970
group_size 2.8220 1.480 1.907 0.057 -0.079 5.723
homeowner -14.1096 0.732 -19.287 0.000 -15.544 -12.676
car_age -0.7839 0.067 -11.637 0.000 -0.916 -0.652
risk_factor -0.7263 0.229 -3.169 0.002 -1.175 -0.277
age_oldest 0.5646 0.064 8.830 0.000 0.439 0.690
age_youngest -0.9566 0.063 -15.303 0.000 -1.079 -0.834
married_couple -9.7020 1.397 -6.943 0.000 -12.441 -6.963
C_previous -6.1607 0.360 -17.136 0.000 -6.865 -5.456
duration_previous -1.4821 0.073 -20.210 0.000 -1.626 -1.338
state_AR 0.9856 3.128 0.315 0.753 -5.145 7.116
state_CO -9.3027 2.564 -3.629 0.000 -14.328 -4.278
state_CT 30.6000 2.901 10.546 0.000 24.913 36.287
state_DC 41.2959 5.099 8.099 0.000 31.301 51.290
state_DE 39.8354 4.605 8.650 0.000 30.808 48.863
state_FL 12.3899 2.160 5.737 0.000 8.157 16.623
state_GA 9.0600 2.381 3.805 0.000 4.392 13.728
state_IA -49.6236 3.487 -14.229 0.000 -56.460 -42.788
state_ID -17.6710 4.150 -4.258 0.000 -25.806 -9.536
state_IN -11.1517 2.566 -4.346 0.000 -16.181 -6.122
state_KS -7.3402 4.374 -1.678 0.093 -15.914 1.234
state_KY 18.9370 2.822 6.711 0.000 13.406 24.468
state_MD 23.0497 2.483 9.283 0.000 18.183 27.917
state_ME -34.4240 4.127 -8.341 0.000 -42.513 -26.335
state_MO -21.7184 2.932 -7.408 0.000 -27.465 -15.972
state_MS -3.2256 3.279 -0.984 0.325 -9.654 3.202
state_MT -8.0685 5.775 -1.397 0.162 -19.389 3.252
state_ND 19.8075 6.140 3.226 0.001 7.773 31.842
state_NE -14.0680 5.295 -2.657 0.008 -24.447 -3.689
state_NH -19.6489 3.728 -5.271 0.000 -26.956 -12.342
state_NM -1.8951 3.715 -0.510 0.610 -9.178 5.388
state_NV 22.8554 2.792 8.187 0.000 17.384 28.327
state_NY 38.4059 2.441 15.734 0.000 33.621 43.190
state_OH -7.3100 2.313 -3.161 0.002 -11.843 -2.777
state_OK -10.9053 2.707 -4.029 0.000 -16.212 -5.599
state_OR -4.9467 2.809 -1.761 0.078 -10.453 0.559
state_PA 9.3950 2.198 4.273 0.000 5.086 13.704
state_RI 25.6570 4.233 6.061 0.000 17.359 33.955
state_SD -17.2127 14.686 -1.172 0.241 -46.001 11.575
state_TN -10.9602 2.606 -4.206 0.000 -16.068 -5.853
state_UT -17.9898 2.820 -6.379 0.000 -23.518 -12.462
state_WA 3.8396 2.597 1.479 0.139 -1.250 8.930
state_WI -32.7110 2.969 -11.019 0.000 -38.530 -26.892
state_WV 25.1020 4.382 5.729 0.000 16.513 33.691
state_WY -6.1411 6.487 -0.947 0.344 -18.856 6.574
car_value_b -61.2799 8.237 -7.440 0.000 -77.426 -45.134
car_value_c -47.1041 5.055 -9.318 0.000 -57.013 -37.196
car_value_d -42.1166 4.835 -8.710 0.000 -51.595 -32.638
car_value_e -42.3329 4.810 -8.800 0.000 -51.762 -32.904
car_value_f -42.1720 4.828 -8.735 0.000 -51.635 -32.709
car_value_g -37.9286 4.869 -7.789 0.000 -47.474 -28.384
car_value_h -29.1962 5.064 -5.765 0.000 -39.123 -19.270
car_value_i -9.3399 6.467 -1.444 0.149 -22.016 3.336
A_1 27.6786 1.577 17.555 0.000 24.588 30.769
A_2 33.2434 1.866 17.812 0.000 29.585 36.902
B_1 2.0702 0.790 2.619 0.009 0.521 3.620
C_2 0.6876 1.122 0.613 0.540 -1.512 2.887
C_3 1.8784 1.165 1.612 0.107 -0.406 4.163
C_4 4.4500 1.747 2.548 0.011 1.026 7.874
D_2 -1.5277 1.200 -1.273 0.203 -3.881 0.825
D_3 -2.2757 1.250 -1.821 0.069 -4.726 0.174
E_1 8.0006 0.870 9.194 0.000 6.295 9.706
F_1 17.1115 1.716 9.971 0.000 13.748 20.475
F_2 16.3817 1.651 9.920 0.000 13.145 19.619
F_3 10.6372 2.357 4.514 0.000 6.018 15.257
G_2 9.1544 0.937 9.773 0.000 7.318 10.991
G_3 0.9010 1.188 0.758 0.448 -1.428 3.230
G_4 4.2100 1.267 3.323 0.001 1.727 6.693
Omnibus: 572.878 Durbin-Watson: 1.998
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1407.689
Skew: 0.266 Prob(JB): 2.11e-306
Kurtosis: 4.564 Cond. No. 3.17e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.17e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [60]:
print("The AIC for model_all is {:.2f}".format(model_all.aic));
The AIC for model_all is 123722.41

4.2

According to model_all, which states are most and least expensive?

Answer.

In [61]:
# Selection all parameters from OLS model
dummy = model_all.params.reset_index(name='Estimate')
dummy = dummy[dummy['index'].map(lambda x: 'state' in x)].sort_values('Estimate', ascending=False) # Select state
# Summarise findings
print('-'*40); print('Most Expensive States'); print('-'*40)
print(dummy.head()); 
print('-'*40); print('Least Expensive States'); print('-'*40)
print(dummy.tail())
----------------------------------------
Most Expensive States
----------------------------------------
       index   Estimate
13  state_DC  41.295934
14  state_DE  39.835392
32  state_NY  38.405908
12  state_CT  30.600011
37  state_RI  25.657045
----------------------------------------
Least Expensive States
----------------------------------------
       index   Estimate
29  state_NH -19.648902
24  state_MO -21.718362
42  state_WI -32.711031
23  state_ME -34.424033
17  state_IA -49.623598
In [62]:
# Remove word state
dummy['abbr'] = dummy['index'].apply(lambda x: re.sub(r'state_','', x))
# Opening downloaded georefence and tagging data for states
US_states = geopandas.read_file("./gz_2010_us_040_00_5m.json", driver="GeoJSON")
state_regions = pd.read_csv("./state_regions.csv")
# Transforming geopandas and merging data
US_states1 = US_states\
.merge(state_regions, how='left', left_on='NAME', right_on='State')\
.merge(dummy.drop(columns='index'), how='left', left_on='State Code', right_on='abbr')
# Fill NaN values - there are some states not contained in the dataset
US_states1['Estimate'] = US_states1['Estimate'].fillna(0)
In [63]:
min_cn, max_cn = US_states1['Estimate'].quantile([0.01,0.99]).apply(round, 2)
colormap = branca.colormap.LinearColormap(colors=['darkred','red','white','blue','darkblue'], vmin=min_cn, vmax=max_cn )
colormap.caption="Regression Estimates of Assurance cost"

USmap = folium.Map(location=[48-10, -102], zoom_start=4.45, tiles="OpenStreetMap")

style_function = lambda x: {
    'fillColor': colormap(x['properties']['Estimate']),
    'color': 'black',
    'weight':1,
    'fillOpacity':0.5
}

stategeo = folium.GeoJson(
            US_states1.to_json(), name='USmap', style_function=style_function,
            tooltip=folium.GeoJsonTooltip(
                fields=['State', 'Region', 'Estimate'],
                aliases=['State', 'Region', 'Coefficient'], 
                localize=True)).add_to(USmap)
colormap.add_to(USmap)
USmap
Out[63]:
Make this Notebook Trusted to load map: File -> Trust Notebook

The states with the highest insurance costs are the District of Columbia and Delaware. While the states with the lowest insurance costs are Iowa and Main. It appears that states in the West region will have lower insurance costs compared to the North East region.

4.3

Interpret the coefficients of group_size, homeowner, car_age, risk_factor, age_oldest, age_youngest married_couple , duration_previous. Do the signs and values of these coefficients make sense to you in the context of this business problem?

Answer.

In [64]:
dummy = model_all.params.reset_index(name='Estimate')
vars_to_look = ['group_size', 'homeowner', 'car_age', 'risk_factor', 'age_oldest', 'age_youngest', 
                'married_couple', 'duration_previous']
# vars_to_look
dummy[dummy['index'].map(lambda x: x in vars_to_look)]
Out[64]:
index Estimate
1 group_size 2.821981
2 homeowner -14.109587
3 car_age -0.783868
4 risk_factor -0.726283
5 age_oldest 0.564635
6 age_youngest -0.956600
7 married_couple -9.702002
9 duration_previous -1.482135

La variable group_size se refiere a

Exercise 5 :

Which variables from model_all are statistically significant? (For categorical variables, consider them to be significant if at least one of their categories are statistically significant). Refit the model using only these variables; call this model_sig. How does this model compare to the previous model?

Answer.

In [65]:
# dummy = model_all.params.reset_index(name='Estimate')
dummy = pd.concat([model_all.params, model_all.bse, model_all.tvalues, model_all.conf_int(), model_all.pvalues], axis=1)
dummy.columns = ['Params', 'SE', 'T', 'IC_LI', 'IC_LS', 'Pval']
dummy = dummy[dummy['Pval'] < 0.05]
dummy = dummy[(np.sign(dummy['IC_LI']) * np.sign(dummy['IC_LS'])) > 0]
# ', '.join(dummy.index)
dummy['IC'] = (dummy['Params'] - dummy['IC_LI'])  # Intervals are symetrical
dummy
Out[65]:
Params SE T IC_LI IC_LS Pval IC
Intercept 678.981533 5.606125 121.114236 667.992651 689.970416 0.000000e+00 10.988883
homeowner -14.109587 0.731564 -19.286867 -15.543568 -12.675606 1.102711e-81 1.433981
car_age -0.783868 0.067361 -11.636757 -0.915907 -0.651829 3.899211e-31 0.132039
risk_factor -0.726283 0.229154 -3.169403 -1.175461 -0.277104 1.531264e-03 0.449179
age_oldest 0.564635 0.063942 8.830457 0.439299 0.689971 1.182506e-18 0.125336
age_youngest -0.956600 0.062511 -15.302982 -1.079131 -0.834069 2.212780e-52 0.122531
married_couple -9.702002 1.397397 -6.942912 -12.441119 -6.962886 4.033772e-12 2.739116
C_previous -6.160703 0.359525 -17.135679 -6.865428 -5.455978 4.555148e-65 0.704725
duration_previous -1.482135 0.073336 -20.210234 -1.625885 -1.338385 2.221374e-89 0.143750
state_CO -9.302668 2.563647 -3.628685 -14.327818 -4.277518 2.860214e-04 5.025150
state_CT 30.600011 2.901447 10.546467 24.912721 36.287300 6.799909e-26 5.687289
state_DC 41.295934 5.098839 8.099086 31.301412 51.290457 6.056751e-16 9.994523
state_DE 39.835392 4.605350 8.649808 30.808185 48.862599 5.794042e-18 9.027207
state_FL 12.389904 2.159717 5.736819 8.156520 16.623287 9.874187e-09 4.233383
state_GA 9.060024 2.381354 3.804568 4.392197 13.727851 1.427387e-04 4.667827
state_IA -49.623598 3.487476 -14.229086 -56.459597 -42.787599 1.388219e-45 6.835999
state_ID -17.670971 4.149973 -4.258092 -25.805568 -9.536373 2.077074e-05 8.134598
state_IN -11.151724 2.565983 -4.345985 -16.181453 -6.121995 1.397636e-05 5.029729
state_KY 18.937016 2.821898 6.710738 13.405655 24.468377 2.021237e-11 5.531361
state_MD 23.049719 2.483076 9.282727 18.182501 27.916938 1.925961e-20 4.867218
state_ME -34.424033 4.126903 -8.341373 -42.513409 -26.334658 8.121169e-17 8.089376
state_MO -21.718362 2.931736 -7.408021 -27.465024 -15.971700 1.365545e-13 5.746662
state_ND 19.807475 6.139552 3.226209 7.772992 31.841958 1.257695e-03 12.034483
state_NE -14.068037 5.295046 -2.656830 -24.447157 -3.688918 7.898089e-03 10.379119
state_NH -19.648902 3.727830 -5.270869 -26.956033 -12.341772 1.380733e-07 7.307131
state_NV 22.855437 2.791545 8.187378 17.383572 28.327303 2.931626e-16 5.471866
state_NY 38.405908 2.440907 15.734276 33.621347 43.190469 3.035111e-55 4.784561
state_OH -7.309977 2.312635 -3.160887 -11.843103 -2.776851 1.576704e-03 4.533126
state_OK -10.905339 2.707046 -4.028502 -16.211572 -5.599106 5.647028e-05 5.306233
state_PA 9.394958 2.198447 4.273452 5.085658 13.704259 1.939163e-05 4.309300
state_RI 25.657045 4.233180 6.060939 17.359350 33.954740 1.392839e-09 8.297695
state_TN -10.960155 2.605726 -4.206180 -16.067787 -5.852523 2.615607e-05 5.107632
state_UT -17.989768 2.820125 -6.379067 -23.517656 -12.461881 1.845470e-10 5.527887
state_WI -32.711031 2.968507 -11.019354 -38.529769 -26.892292 4.170180e-28 5.818739
state_WV 25.101955 4.381787 5.728703 16.512968 33.690943 1.035696e-08 8.588988
car_value_b -61.279927 8.237039 -7.439558 -77.425812 -45.134041 1.077222e-13 16.145886
car_value_c -47.104097 5.054941 -9.318426 -57.012573 -37.195620 1.379870e-20 9.908477
car_value_d -42.116633 4.835443 -8.709985 -51.594858 -32.638408 3.424511e-18 9.478225
car_value_e -42.332916 4.810367 -8.800351 -51.761988 -32.903845 1.544462e-18 9.429072
car_value_f -42.171960 4.827756 -8.735312 -51.635118 -32.708801 2.741704e-18 9.463158
car_value_g -37.928569 4.869478 -7.789041 -47.473509 -28.383629 7.291639e-15 9.544940
car_value_h -29.196230 5.064249 -5.765165 -39.122950 -19.269509 8.353539e-09 9.926720
A_1 27.678631 1.576723 17.554532 24.588008 30.769255 3.701649e-68 3.090624
A_2 33.243447 1.866344 17.812072 29.585121 36.901773 4.299254e-70 3.658326
B_1 2.070231 0.790467 2.618997 0.520792 3.619670 8.829673e-03 1.549439
C_4 4.449988 1.746685 2.547676 1.026212 7.873764 1.085634e-02 3.423776
E_1 8.000633 0.870244 9.193548 6.294818 9.706448 4.406035e-20 1.705815
F_1 17.111497 1.716084 9.971247 13.747704 20.475291 2.496333e-23 3.363793
F_2 16.381717 1.651314 9.920412 13.144883 19.618551 4.142729e-23 3.236834
F_3 10.637241 2.356735 4.513550 6.017672 15.256810 6.434215e-06 4.619569
G_2 9.154416 0.936747 9.772557 7.318244 10.990587 1.782221e-22 1.836171
G_4 4.210024 1.266957 3.322942 1.726590 6.693457 8.933285e-04 2.483433
In [66]:
dummy1 = dummy.sort_values('Params').reset_index()
dummy1 = dummy1[dummy1['index'] != 'Intercept']
# sns.color_palette("rocket", as_cmap=True)
# print(dummy1)
fig, axs = plt.subplots(1, 3, sharex=True, figsize = (13,4))

axs[0].errorbar(dummy1.iloc[:17]['Params'], dummy1.iloc[:17]['index'], 
                xerr = dummy1.iloc[:17]['IC'], fmt='o')
fig.suptitle('model_all: parameter estimates (except intercept)', fontsize=16)

axs[0].axvline(color='black', ls='--')
# 
axs[1].errorbar(dummy1.iloc[17:34]['Params'], 
                dummy1.iloc[17:34]['index'], 
                xerr = dummy1.iloc[17:34]['IC'], fmt='o')
axs[1].axvline(color='black', ls='--')
# 
axs[2].errorbar(dummy1.iloc[34:]['Params'], 
                dummy1.iloc[34:]['index'], 
                xerr = dummy1.iloc[34:]['IC'], fmt='o')
axs[2].axvline(color='black', ls='--')
# 
fig.set_tight_layout(True)

The full model (model_all) has several variables which are significant, and this are: homeowner, car_age, risk_factor, age_oldest, age_youngest, married_couple, C_previous, duration_previous, state (all dummy variables), car_value, A, B, C, E, F, G. These variables were considered as significant because they have a p-value less than 0.05 and confidence intervals that do not go through zero.
It can be seen that there are parameters with very small effect sizes such as car_age, age_oldest, risk_factor, age_youngest, and duration_previous.

In [67]:
d  = model_all.params.reset_index(name='Estimate')

other_vars = (' + '
              .join(
                  d[d['index']
                    .map(lambda x: re.match(r'^state|^car_value|^[ABCEFG]\_\d', x) != None)]
                  ['index']
              ))

model_sig = smf.ols(formula = "cost ~ homeowner + car_age + risk_factor + age_oldest" 
                    " + age_youngest + married_couple + C_previous"
                    " + duration_previous + " + other_vars, 
       data = train).fit()

model_sig.summary()
Out[67]:
OLS Regression Results
Dep. Variable: cost R-squared: 0.435
Model: OLS Adj. R-squared: 0.432
Method: Least Squares F-statistic: 148.3
Date: Tue, 06 Oct 2020 Prob (F-statistic): 0.00
Time: 13:40:24 Log-Likelihood: -61797.
No. Observations: 12386 AIC: 1.237e+05
Df Residuals: 12321 BIC: 1.242e+05
Df Model: 64
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 681.4666 5.401 126.172 0.000 670.880 692.054
homeowner -14.1108 0.732 -19.289 0.000 -15.545 -12.677
car_age -0.7900 0.067 -11.741 0.000 -0.922 -0.658
risk_factor -0.7319 0.229 -3.194 0.001 -1.181 -0.283
age_oldest 0.6421 0.050 12.776 0.000 0.544 0.741
age_youngest -1.0338 0.048 -21.517 0.000 -1.128 -0.940
married_couple -7.5912 0.848 -8.951 0.000 -9.254 -5.929
C_previous -6.1343 0.359 -17.072 0.000 -6.839 -5.430
duration_previous -1.4842 0.073 -20.270 0.000 -1.628 -1.341
state_AR 1.0931 3.120 0.350 0.726 -5.022 7.208
state_CO -9.6644 2.556 -3.781 0.000 -14.675 -4.654
state_CT 29.8744 2.882 10.365 0.000 24.225 35.524
state_DC 40.6764 5.093 7.986 0.000 30.693 50.660
state_DE 39.8379 4.606 8.650 0.000 30.810 48.866
state_FL 11.9712 2.148 5.574 0.000 7.761 16.181
state_GA 8.8888 2.376 3.741 0.000 4.231 13.547
state_IA -50.0426 3.480 -14.379 0.000 -56.864 -43.221
state_ID -17.8641 4.149 -4.305 0.000 -25.998 -9.731
state_IN -11.3230 2.565 -4.415 0.000 -16.351 -6.295
state_KS -7.6266 4.372 -1.744 0.081 -16.197 0.944
state_KY 18.4981 2.816 6.568 0.000 12.978 24.019
state_MD 22.9418 2.482 9.242 0.000 18.076 27.807
state_ME -34.7960 4.122 -8.442 0.000 -42.875 -26.717
state_MO -22.0427 2.926 -7.533 0.000 -27.778 -16.307
state_MS -2.9730 3.278 -0.907 0.365 -9.399 3.453
state_MT -8.6153 5.770 -1.493 0.135 -19.925 2.694
state_ND 19.1713 6.134 3.125 0.002 7.147 31.195
state_NE -14.1545 5.294 -2.673 0.008 -24.532 -3.777
state_NH -20.0379 3.724 -5.380 0.000 -27.338 -12.738
state_NM -1.6658 3.714 -0.449 0.654 -8.946 5.614
state_NV 22.4160 2.784 8.051 0.000 16.959 27.873
state_NY 37.7854 2.428 15.565 0.000 33.027 42.544
state_OH -7.5853 2.307 -3.289 0.001 -12.106 -3.064
state_OK -11.0094 2.707 -4.067 0.000 -16.315 -5.704
state_OR -5.2102 2.807 -1.856 0.063 -10.713 0.292
state_PA 9.1650 2.195 4.175 0.000 4.862 13.468
state_RI 25.5306 4.233 6.031 0.000 17.233 33.828
state_SD -17.7460 14.684 -1.209 0.227 -46.529 11.037
state_TN -10.8741 2.606 -4.173 0.000 -15.981 -5.767
state_UT -18.1394 2.815 -6.444 0.000 -23.657 -12.622
state_WA 3.6444 2.595 1.404 0.160 -1.443 8.732
state_WI -33.1994 2.959 -11.221 0.000 -38.999 -27.400
state_WV 25.1617 4.381 5.744 0.000 16.575 33.749
state_WY -6.2699 6.484 -0.967 0.334 -18.979 6.439
car_value_b -61.2495 8.238 -7.435 0.000 -77.396 -45.103
car_value_c -47.3702 5.052 -9.376 0.000 -57.274 -37.467
car_value_d -42.2855 4.835 -8.747 0.000 -51.762 -32.809
car_value_e -42.4564 4.810 -8.827 0.000 -51.885 -33.028
car_value_f -42.2553 4.827 -8.754 0.000 -51.717 -32.793
car_value_g -38.0863 4.869 -7.822 0.000 -47.631 -28.542
car_value_h -29.3740 5.064 -5.801 0.000 -39.300 -19.448
car_value_i -9.3146 6.467 -1.440 0.150 -21.991 3.362
A_1 27.6944 1.576 17.571 0.000 24.605 30.784
A_2 33.1651 1.866 17.774 0.000 29.507 36.823
B_1 1.9545 0.787 2.483 0.013 0.412 3.497
C_2 -0.1573 1.007 -0.156 0.876 -2.132 1.817
C_3 0.7208 0.970 0.744 0.457 -1.180 2.621
C_4 3.2568 1.594 2.044 0.041 0.133 6.381
E_1 7.9497 0.870 9.137 0.000 6.244 9.655
F_1 16.9080 1.714 9.866 0.000 13.549 20.267
F_2 16.1576 1.648 9.803 0.000 12.927 19.389
F_3 10.5932 2.355 4.499 0.000 5.978 15.209
G_2 9.2298 0.936 9.864 0.000 7.396 11.064
G_3 0.8956 1.188 0.754 0.451 -1.433 3.224
G_4 4.3743 1.264 3.460 0.001 1.896 6.852
Omnibus: 577.432 Durbin-Watson: 1.997
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1417.871
Skew: 0.269 Prob(JB): 1.30e-308
Kurtosis: 4.568 Cond. No. 3.17e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.17e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [78]:
# Predictive Performance Metrics
def MAE(prediction,true_values):
    return np.mean(np.abs(prediction-true_values))

def RMSE(prediction,true_values):
    return np.sqrt(np.mean(np.square(prediction-true_values)))

def MAPE(prediction,true_value):
    return np.mean(np.abs((prediction-true_value)/true_value)*100)
In [79]:
df_Metrics = pd.DataFrame({'Model': ['model_all', 'model_sig']})

df_Metrics['df_model']     = list(map(lambda x: eval(x).df_model, df_Metrics['Model']))
df_Metrics['df_resid']     = list(map(lambda x: eval(x).df_resid, df_Metrics['Model']))
df_Metrics['R2']           = list(map(lambda x: eval(x).rsquared, df_Metrics['Model']))
df_Metrics['R2_adj']       = list(map(lambda x: eval(x).rsquared_adj, df_Metrics['Model']))
df_Metrics['F_pval']       = list(map(lambda x: eval(x).f_pvalue, df_Metrics['Model']))
df_Metrics['AIC']          = list(map(lambda x: eval(x).aic, df_Metrics['Model']))
df_Metrics['BIC']          = list(map(lambda x: eval(x).bic, df_Metrics['Model']))
df_Metrics['Cond_Num']     = list(map(lambda x: eval(x).condition_number, df_Metrics['Model']))
df_Metrics['MAE']          = list(map(lambda x: MAE(eval(x).predict(test), test['cost']), df_Metrics['Model']))
df_Metrics['RMSE']         = list(map(lambda x: RMSE(eval(x).predict(test), test['cost']), df_Metrics['Model']))
df_Metrics['MAPE']         = list(map(lambda x: MAPE(eval(x).predict(test), test['cost']), df_Metrics['Model']))

df_Metrics
Out[79]:
Model df_model df_resid R2 R2_adj F_pval AIC BIC Cond_Num MAE RMSE MAPE
0 model_all 67.0 12318.0 0.435503 0.432432 0.0 123722.407548 124227.261449 3174.191652 27.472577 35.437075 4.356283
1 model_sig 64.0 12321.0 0.435180 0.432246 0.0 123723.491151 124206.072087 3171.314214 27.462750 35.431653 4.354072

Exercise 6 :

In addition to the variables in model_sig, add terms for:

  1. square of age_youngest
  2. square term for the age of the car
  3. interaction term for car_value and age_youngest

and save it to a new model model_sig_plus.

Answer.

In [80]:
model_sig_plus = smf.ols(formula = "cost ~ homeowner + car_age + risk_factor + age_oldest" 
                            " + age_youngest + married_couple + C_previous"
                            " + I(age_youngest**2) + I(car_age**2) "
                            " + car_value_b*age_youngest "
                             " + car_value_c*age_youngest "
                             " + car_value_d*age_youngest "
                             " + car_value_e*age_youngest "
                             " + car_value_f*age_youngest "
                             " + car_value_g*age_youngest "
                             " + car_value_h*age_youngest "
                             " + car_value_i*age_youngest "
                            " + duration_previous + " + other_vars, 
       data = train).fit()

model_sig_plus.summary()
Out[80]:
OLS Regression Results
Dep. Variable: cost R-squared: 0.452
Model: OLS Adj. R-squared: 0.448
Method: Least Squares F-statistic: 137.1
Date: Tue, 06 Oct 2020 Prob (F-statistic): 0.00
Time: 13:47:27 Log-Likelihood: -61613.
No. Observations: 12386 AIC: 1.234e+05
Df Residuals: 12311 BIC: 1.239e+05
Df Model: 74
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 733.4007 12.351 59.381 0.000 709.191 757.610
homeowner -13.1200 0.725 -18.108 0.000 -14.540 -11.700
car_age -1.3484 0.140 -9.655 0.000 -1.622 -1.075
risk_factor -0.6571 0.226 -2.906 0.004 -1.100 -0.214
age_oldest 0.5104 0.050 10.174 0.000 0.412 0.609
age_youngest -3.3977 0.291 -11.684 0.000 -3.968 -2.828
married_couple -7.0646 0.837 -8.436 0.000 -8.706 -5.423
C_previous -6.7749 0.356 -19.006 0.000 -7.474 -6.076
I(age_youngest ** 2) 0.0225 0.001 17.878 0.000 0.020 0.025
I(car_age ** 2) 0.0271 0.006 4.617 0.000 0.016 0.039
car_value_b -107.2589 23.094 -4.644 0.000 -152.528 -61.990
car_value_b:age_youngest 1.1253 0.507 2.217 0.027 0.131 2.120
car_value_c -52.1199 12.620 -4.130 0.000 -76.858 -27.382
car_value_c:age_youngest 0.2831 0.286 0.991 0.322 -0.277 0.843
car_value_d -56.3760 12.035 -4.684 0.000 -79.967 -32.785
car_value_d:age_youngest 0.5341 0.272 1.967 0.049 0.002 1.066
car_value_e -50.2749 11.973 -4.199 0.000 -73.745 -26.805
car_value_e:age_youngest 0.3957 0.270 1.466 0.143 -0.133 0.925
car_value_f -48.5645 12.001 -4.047 0.000 -72.088 -25.041
car_value_f:age_youngest 0.3747 0.270 1.385 0.166 -0.155 0.905
car_value_g -41.5701 12.138 -3.425 0.001 -65.363 -17.777
car_value_g:age_youngest 0.3245 0.273 1.188 0.235 -0.211 0.860
car_value_h -29.9548 12.916 -2.319 0.020 -55.273 -4.637
car_value_h:age_youngest 0.2707 0.285 0.949 0.343 -0.289 0.830
car_value_i 10.7328 20.882 0.514 0.607 -30.198 51.664
car_value_i:age_youngest -0.0801 0.431 -0.186 0.852 -0.925 0.765
duration_previous -1.4815 0.072 -20.494 0.000 -1.623 -1.340
state_AR 0.9598 3.083 0.311 0.756 -5.083 7.003
state_CO -8.0462 2.526 -3.186 0.001 -12.997 -3.096
state_CT 31.8703 2.850 11.182 0.000 26.283 37.457
state_DC 42.7592 5.025 8.510 0.000 32.910 52.609
state_DE 38.0979 4.561 8.354 0.000 29.158 47.037
state_FL 11.0251 2.123 5.194 0.000 6.864 15.186
state_GA 9.7240 2.347 4.144 0.000 5.124 14.324
state_IA -49.3181 3.434 -14.362 0.000 -56.049 -42.587
state_ID -18.1539 4.095 -4.433 0.000 -26.182 -10.126
state_IN -11.3521 2.534 -4.480 0.000 -16.319 -6.385
state_KS -8.3631 4.318 -1.937 0.053 -16.828 0.102
state_KY 20.2506 2.784 7.273 0.000 14.793 25.708
state_MD 24.5988 2.457 10.012 0.000 19.783 29.415
state_ME -32.9909 4.074 -8.097 0.000 -40.977 -25.005
state_MO -22.4998 2.890 -7.785 0.000 -28.165 -16.835
state_MS -1.9399 3.237 -0.599 0.549 -8.286 4.406
state_MT -10.5961 5.691 -1.862 0.063 -21.751 0.559
state_ND 18.6754 6.054 3.085 0.002 6.808 30.542
state_NE -14.7687 5.224 -2.827 0.005 -25.009 -4.528
state_NH -20.4444 3.692 -5.537 0.000 -27.682 -13.207
state_NM -2.5308 3.672 -0.689 0.491 -9.729 4.667
state_NV 23.2107 2.750 8.440 0.000 17.820 28.601
state_NY 39.2659 2.403 16.343 0.000 34.556 43.975
state_OH -7.4839 2.277 -3.287 0.001 -11.947 -3.021
state_OK -11.8672 2.675 -4.437 0.000 -17.110 -6.624
state_OR -4.6051 2.774 -1.660 0.097 -10.042 0.832
state_PA 10.0657 2.170 4.638 0.000 5.811 14.320
state_RI 23.0669 4.180 5.519 0.000 14.874 31.260
state_SD -11.1599 14.479 -0.771 0.441 -39.541 17.222
state_TN -10.5800 2.574 -4.111 0.000 -15.625 -5.535
state_UT -19.2826 2.785 -6.925 0.000 -24.741 -13.824
state_WA 3.9403 2.572 1.532 0.126 -1.101 8.982
state_WI -30.6412 2.924 -10.479 0.000 -36.373 -24.909
state_WV 27.0906 4.322 6.268 0.000 18.619 35.563
state_WY -5.0869 6.396 -0.795 0.426 -17.624 7.450
A_1 27.6656 1.554 17.803 0.000 24.619 30.712
A_2 32.8168 1.840 17.833 0.000 29.210 36.424
B_1 1.7847 0.777 2.297 0.022 0.262 3.307
C_2 -0.6112 0.994 -0.615 0.539 -2.559 1.337
C_3 0.2584 0.958 0.270 0.787 -1.619 2.136
C_4 1.7734 1.575 1.126 0.260 -1.314 4.861
E_1 8.6239 0.860 10.032 0.000 6.939 10.309
F_1 16.6872 1.691 9.870 0.000 13.373 20.001
F_2 16.3530 1.626 10.058 0.000 13.166 19.540
F_3 11.5693 2.322 4.982 0.000 7.018 16.121
G_2 9.0894 0.924 9.839 0.000 7.279 10.900
G_3 1.0733 1.173 0.915 0.360 -1.226 3.373
G_4 4.6908 1.248 3.760 0.000 2.245 7.136
Omnibus: 612.029 Durbin-Watson: 1.996
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1735.838
Skew: 0.234 Prob(JB): 0.00
Kurtosis: 4.773 Cond. No. 3.22e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.22e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [81]:
df_Metrics = pd.DataFrame({'Model': ['model_all', 'model_sig', 'model_sig_plus']})

df_Metrics['df_model']     = list(map(lambda x: eval(x).df_model, df_Metrics['Model']))
df_Metrics['df_resid']     = list(map(lambda x: eval(x).df_resid, df_Metrics['Model']))
df_Metrics['R2']           = list(map(lambda x: eval(x).rsquared, df_Metrics['Model']))
df_Metrics['R2_adj']       = list(map(lambda x: eval(x).rsquared_adj, df_Metrics['Model']))
df_Metrics['F_pval']       = list(map(lambda x: eval(x).f_pvalue, df_Metrics['Model']))
df_Metrics['AIC']          = list(map(lambda x: eval(x).aic, df_Metrics['Model']))
df_Metrics['BIC']          = list(map(lambda x: eval(x).bic, df_Metrics['Model']))
df_Metrics['Cond_Num']     = list(map(lambda x: eval(x).condition_number, df_Metrics['Model']))
df_Metrics['MAE']          = list(map(lambda x: MAE(eval(x).predict(test), test['cost']), df_Metrics['Model']))
df_Metrics['RMSE']         = list(map(lambda x: RMSE(eval(x).predict(test), test['cost']), df_Metrics['Model']))
df_Metrics['MAPE']         = list(map(lambda x: MAPE(eval(x).predict(test), test['cost']), df_Metrics['Model']))

df_Metrics
Out[81]:
Model df_model df_resid R2 R2_adj F_pval AIC BIC Cond_Num MAE RMSE MAPE
0 model_all 67.0 12318.0 0.435503 0.432432 0.0 123722.407548 124227.261449 3174.191652 27.472577 35.437075 4.356283
1 model_sig 64.0 12321.0 0.435180 0.432246 0.0 123723.491151 124206.072087 3171.314214 27.462750 35.431653 4.354072
2 model_sig_plus 74.0 12311.0 0.451707 0.448412 0.0 123375.641999 123932.466155 321637.206271 27.151956 35.173398 4.303648

According to the goodness of fit and performance measures, the best model is model_sig_plus.

Feature selection

To reduce the number of features, it can often be helpful to aggregate the categories; for example, we can create a new variable by assigning each state to a larger region:

In [87]:
state_regions = pd.read_csv('./state_regions.csv')
# should download the above file
state_regions.head()
Out[87]:
State State Code Region Division
0 Alaska AK West Pacific
1 Alabama AL South East South Central
2 Arkansas AR South West South Central
3 Arizona AZ West Mountain
4 California CA West Pacific

Exercise 7 :

7.1

Create a new column where a state is replaced with the region it is in according to the above table.

Answer.

In [83]:
df_cat2 = df.merge(state_regions, how='left', left_on='state', right_on='State Code')
# 
cat_variables = df_cat2.select_dtypes(exclude=['number', 'bool']).columns
cat_variables = [i for i in cat_variables if i not in ['state', 'State', 'State Code']]

df_cat2 = pd.get_dummies(df_cat2, columns=cat_variables, drop_first=True)
df_cat2.columns
Out[83]:
Index(['state', 'group_size', 'homeowner', 'car_age', 'risk_factor',
       'age_oldest', 'age_youngest', 'married_couple', 'C_previous',
       'duration_previous', 'cost', 'State', 'State Code', 'car_value_b',
       'car_value_c', 'car_value_d', 'car_value_e', 'car_value_f',
       'car_value_g', 'car_value_h', 'car_value_i', 'A_1', 'A_2', 'B_1', 'C_2',
       'C_3', 'C_4', 'D_2', 'D_3', 'E_1', 'F_1', 'F_2', 'F_3', 'G_2', 'G_3',
       'G_4', 'Region_Northeast', 'Region_South', 'Region_West',
       'Division_East South Central', 'Division_Middle Atlantic',
       'Division_Mountain', 'Division_New England', 'Division_Pacific',
       'Division_South Atlantic', 'Division_West North Central',
       'Division_West South Central'],
      dtype='object')

7.2

Fit the model as in model_sig_plus but this time use region instead of state. Call this model_region.

Answer.

In [85]:
d  = pd.Series(df_cat2.columns)[df_cat2.columns.map(lambda x: re.match(r'^car_value|^[ABCEFG]\_\d', x) !=None) ]
d = list(d)
d.extend(['Region_Northeast', 'Region_South', 'Region_West'])
other_vars = ' + '.join(d)
In [92]:
train, test2 = train_test_split(df_cat2, test_size=0.2, random_state=1337)

model_region = smf.ols(formula = "cost ~ homeowner + car_age + risk_factor + age_oldest" 
                            " + age_youngest + married_couple + C_previous"
                            " + I(age_youngest**2) + I(car_age**2) "
                            " + car_value_b*age_youngest "
                             " + car_value_c*age_youngest "
                             " + car_value_d*age_youngest "
                             " + car_value_e*age_youngest "
                             " + car_value_f*age_youngest "
                             " + car_value_g*age_youngest "
                             " + car_value_h*age_youngest "
                             " + car_value_i*age_youngest "
                            " + duration_previous + " + other_vars, 
       data = train).fit()

model_region.summary()
Out[92]:
OLS Regression Results
Dep. Variable: cost R-squared: 0.380
Model: OLS Adj. R-squared: 0.378
Method: Least Squares F-statistic: 180.0
Date: Tue, 06 Oct 2020 Prob (F-statistic): 0.00
Time: 13:58:21 Log-Likelihood: -62375.
No. Observations: 12386 AIC: 1.248e+05
Df Residuals: 12343 BIC: 1.252e+05
Df Model: 42
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 706.8715 12.833 55.081 0.000 681.716 732.027
homeowner -13.9617 0.758 -18.431 0.000 -15.447 -12.477
car_age -1.5214 0.147 -10.342 0.000 -1.810 -1.233
risk_factor 0.3819 0.225 1.699 0.089 -0.059 0.823
age_oldest 0.5300 0.053 10.000 0.000 0.426 0.634
age_youngest -3.1114 0.307 -10.142 0.000 -3.713 -2.510
married_couple -7.5110 0.884 -8.495 0.000 -9.244 -5.778
C_previous -7.2932 0.374 -19.496 0.000 -8.026 -6.560
I(age_youngest ** 2) 0.0201 0.001 15.135 0.000 0.017 0.023
I(car_age ** 2) 0.0363 0.006 5.879 0.000 0.024 0.048
car_value_b -99.5566 24.384 -4.083 0.000 -147.354 -51.759
car_value_b:age_youngest 1.0899 0.536 2.032 0.042 0.039 2.141
car_value_c -44.2680 13.328 -3.321 0.001 -70.393 -18.143
car_value_c:age_youngest 0.2372 0.302 0.785 0.433 -0.355 0.830
car_value_d -52.4174 12.700 -4.127 0.000 -77.312 -27.523
car_value_d:age_youngest 0.5539 0.287 1.930 0.054 -0.009 1.117
car_value_e -43.2489 12.619 -3.427 0.001 -67.984 -18.514
car_value_e:age_youngest 0.3477 0.285 1.219 0.223 -0.211 0.907
car_value_f -40.5115 12.649 -3.203 0.001 -65.306 -15.717
car_value_f:age_youngest 0.3094 0.286 1.082 0.279 -0.251 0.870
car_value_g -34.8712 12.787 -2.727 0.006 -59.937 -9.806
car_value_g:age_youngest 0.2775 0.289 0.962 0.336 -0.288 0.843
car_value_h -18.0250 13.601 -1.325 0.185 -44.684 8.634
car_value_h:age_youngest 0.0991 0.301 0.329 0.742 -0.492 0.690
car_value_i 16.7005 22.114 0.755 0.450 -26.646 60.047
car_value_i:age_youngest -0.0938 0.457 -0.205 0.837 -0.989 0.801
duration_previous -1.4280 0.076 -18.752 0.000 -1.577 -1.279
A_1 40.1570 1.412 28.438 0.000 37.389 42.925
A_2 46.1512 1.727 26.726 0.000 42.766 49.536
B_1 1.5940 0.807 1.975 0.048 0.012 3.176
C_2 -0.2220 0.994 -0.223 0.823 -2.170 1.726
C_3 1.2744 0.972 1.311 0.190 -0.631 3.180
C_4 3.4582 1.627 2.125 0.034 0.269 6.648
E_1 11.4106 0.898 12.705 0.000 9.650 13.171
F_1 -4.1112 1.325 -3.102 0.002 -6.709 -1.513
F_2 -3.0886 1.289 -2.396 0.017 -5.615 -0.562
F_3 -8.1684 2.172 -3.761 0.000 -12.426 -3.911
G_2 11.2242 0.922 12.169 0.000 9.416 13.032
G_3 6.9335 1.054 6.576 0.000 4.867 9.000
G_4 7.8528 1.264 6.214 0.000 5.376 10.330
Region_Northeast 31.5091 1.228 25.652 0.000 29.101 33.917
Region_South 23.3643 1.030 22.675 0.000 21.345 25.384
Region_West 14.0022 1.192 11.750 0.000 11.666 16.338
Omnibus: 434.077 Durbin-Watson: 1.984
Prob(Omnibus): 0.000 Jarque-Bera (JB): 827.410
Skew: 0.268 Prob(JB): 2.14e-180
Kurtosis: 4.147 Cond. No. 3.19e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.19e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Exercise 8 :

8.1

What should we do next to minimize features?

Answer.


To decrease the number of variables, we could find the equivalent of the variable car_values in numbers to have a single parameter for this variable. This option does not seem viable for the reduction of insurance options since it seems a determining variable in the cost of insurance but is qualitative by nature.

8.2

Using a method of your choice, find the numerical feature(s) in model_region, except for the three we added in Exercise 6, which exhibit multicollinearity. Hint: consider looking at correlations.

Answer.

In [26]:
M_df = pd.DataFrame(data    = model_region.model.data.exog,
                    columns = model_region.model.data.param_names)

M_df.drop(columns = ['I(age_youngest ** 2)', 'I(car_age ** 2)',
           'car_value_b:age_youngest', 'car_value_c:age_youngest',
           'car_value_d:age_youngest', 'car_value_e:age_youngest',
           'car_value_f:age_youngest', 'car_value_g:age_youngest',
           'car_value_h:age_youngest', 'car_value_i:age_youngest', 'Intercept'], inplace=True)

plt.figure(figsize=(15,13))
sns.heatmap(M_df.corr(), cmap="RdYlBu", 
    annot=True, square=True,
    vmin=-0.8, vmax=0.8, fmt="+.1f",
    annot_kws={'fontsize':'x-small'})

plt.title("model_region: correlations between predictors \n",
          fontdict={'fontsize':20, 'verticalalignment':'bottom'});
In [27]:
corr_df = pd.melt(M_df.corr().reset_index(), id_vars='index')\
.rename(columns={'index':'var1', 'variable':'var2'})\
.dropna()\
.query('value != 1')\
.sort_values('value')

Below, a non-parametric bootstrap implementation was performed to find the confidence interval of the correlation between various variables and determine whether this correlation was significantly different from zero.

In [ ]:
corr_df['dat1'] = [M_df.loc[:, [x, y]] for x,y in zip(corr_df['var1'], corr_df['var2'])]

# run bootstrap
def customBoostrap(data, samples, alpha = 0.05):
    size = data.shape[0]
    B_stats = list()

    for i in range(samples):
        train = resample(list(range(size)), n_samples = size)  # Sample indices
        train_data = data.iloc[train, ]                        # Select sampled data
        stat = train_data.corr().iloc[0,1]                     # Calculate statistic
        B_stats.append(stat)                                   # Append to list

    upper = np.percentile(B_stats, (1.0 - alpha/2.0) * 100 )
    lower = np.percentile(B_stats, (alpha/2.0) * 100)
    return (lower, upper)

corr_df['B_CI'] = corr_df['dat1'].map(lambda x : customBoostrap(x, 100))
# 
In [ ]:
corr_df['LI'] = corr_df['B_CI'].map(lambda x: round(x[0],3))
corr_df['LS'] = corr_df['B_CI'].map(lambda x: round(x[1],3))
corr_df.drop(columns = ['dat1', 'B_CI'], inplace=True)
corr_df.reset_index(drop=True, inplace=True)
In [28]:
# corr_df.to_csv('corr_data.csv')
# corr_df = pd.read_csv('./corr_data.csv', index_col=0)
In [29]:
# Correlation: (752/992) 75.8% of the variable pairs have a significant correlation
corr_df = corr_df[(np.sign(corr_df.LI) * np.sign(corr_df.LS)) > 0] 
corr_df
## Summary of variables with highest correlation
print(pd.concat([corr_df.head(10).reset_index(drop=True), 
                 corr_df.tail(10).reset_index(drop=True)], axis=1, ignore_index=False))
               var1              var2     value     LI     LS          var1  \
0               A_2               A_1 -0.511740 -0.501 -0.523     homeowner   
1               A_1               A_2 -0.511740 -0.502 -0.523    age_oldest   
2               G_3               G_2 -0.502244 -0.492 -0.513           F_2   
3               G_2               G_3 -0.502244 -0.494 -0.514           A_1   
4      Region_South  Region_Northeast -0.495516 -0.488 -0.506           B_1   
5  Region_Northeast      Region_South -0.495516 -0.487 -0.503           E_1   
6       car_value_e       car_value_f -0.436237 -0.427 -0.443           A_2   
7       car_value_f       car_value_e -0.436237 -0.429 -0.447           F_3   
8               C_2               C_3 -0.424528 -0.415 -0.432    age_oldest   
9               C_3               C_2 -0.424528 -0.414 -0.434  age_youngest   

           var2     value     LI     LS  
0    age_oldest  0.408401  0.425  0.391  
1     homeowner  0.408401  0.423  0.395  
2           A_1  0.411243  0.429  0.396  
3           F_2  0.411243  0.425  0.396  
4           E_1  0.509773  0.526  0.495  
5           B_1  0.509773  0.525  0.500  
6           F_3  0.580115  0.600  0.557  
7           A_2  0.580115  0.601  0.556  
8  age_youngest  0.918154  0.923  0.913  
9    age_oldest  0.918154  0.924  0.913  

8.3:

Refit model_region after dropping these redundant predictor(s); call this model_region_no_oldest.

Answer.

The variables age_oldest, A_1, A_2, B_1, C_2, C_3, C_4, E_1, F_1, F_2, F_3, G_2, G_3, G_4 show high correlation wit other variables removing them could reduce condition number.

In [98]:
df_cat2 = df.merge(state_regions, how='left', left_on='state', right_on='State Code')

train, test3 = train_test_split(df_cat2, test_size=0.2, random_state=1337)
In [99]:
model_region_no_oldest = smf.ols(formula = "cost ~ homeowner + car_age + risk_factor " 
                            " + married_couple + C_previous"
                            " + I(age_youngest**2) + I(car_age**2) "
                            " + car_value * age_youngest "
                            " + duration_previous ", 
       data = train).fit()

model_region_no_oldest.summary()
Out[99]:
OLS Regression Results
Dep. Variable: cost R-squared: 0.184
Model: OLS Adj. R-squared: 0.182
Method: Least Squares F-statistic: 110.9
Date: Tue, 06 Oct 2020 Prob (F-statistic): 0.00
Time: 14:01:29 Log-Likelihood: -63865.
No. Observations: 12348 AIC: 1.278e+05
Df Residuals: 12322 BIC: 1.280e+05
Df Model: 25
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 832.3881 27.106 30.708 0.000 779.255 885.521
car_value[T.b] -142.1041 35.184 -4.039 0.000 -211.071 -73.138
car_value[T.c] -101.9875 27.382 -3.725 0.000 -155.660 -48.315
car_value[T.d] -119.2745 27.244 -4.378 0.000 -172.677 -65.872
car_value[T.e] -107.8027 27.198 -3.964 0.000 -161.114 -54.491
car_value[T.f] -98.3252 27.229 -3.611 0.000 -151.698 -44.953
car_value[T.g] -97.2428 27.329 -3.558 0.000 -150.811 -43.674
car_value[T.h] -75.1792 27.853 -2.699 0.007 -129.775 -20.583
car_value[T.i] -52.8299 34.120 -1.548 0.122 -119.711 14.052
homeowner -11.9219 0.849 -14.036 0.000 -13.587 -10.257
car_age -2.9453 0.185 -15.954 0.000 -3.307 -2.583
risk_factor 0.7656 0.257 2.984 0.003 0.263 1.269
married_couple -4.2204 0.976 -4.324 0.000 -6.134 -2.307
C_previous -4.9563 0.360 -13.761 0.000 -5.662 -4.250
I(age_youngest ** 2) 0.0188 0.001 12.628 0.000 0.016 0.022
I(car_age ** 2) 0.0244 0.008 2.896 0.004 0.008 0.041
age_youngest -2.9092 0.484 -6.007 0.000 -3.858 -1.960
car_value[T.b]:age_youngest 1.4417 0.696 2.071 0.038 0.077 2.806
car_value[T.c]:age_youngest 0.6264 0.479 1.308 0.191 -0.312 1.565
car_value[T.d]:age_youngest 1.0588 0.468 2.265 0.024 0.142 1.975
car_value[T.e]:age_youngest 0.8455 0.466 1.815 0.070 -0.068 1.759
car_value[T.f]:age_youngest 0.6844 0.466 1.467 0.142 -0.230 1.599
car_value[T.g]:age_youngest 0.7563 0.469 1.613 0.107 -0.163 1.675
car_value[T.h]:age_youngest 0.4949 0.480 1.032 0.302 -0.445 1.435
car_value[T.i]:age_youngest 0.6213 0.618 1.005 0.315 -0.590 1.833
duration_previous -0.9430 0.086 -11.024 0.000 -1.111 -0.775
Omnibus: 160.713 Durbin-Watson: 1.990
Prob(Omnibus): 0.000 Jarque-Bera (JB): 273.156
Skew: 0.087 Prob(JB): 4.84e-60
Kurtosis: 3.708 Cond. No. 5.83e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.83e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

The model still shows important collinearity problems, and this could be eliminated by removing other variables (such as those added in exercise 6).

8.4

What would you do to diagnose the model_region_no_oldest fit? What does this diagnosis suggest to you? (Hint: try plotting the residuals in various ways.)

Answer.

In [33]:
def diagnosticPlot(model):
    res1 = model.resid
    sca1 = model.scale
    yobs = model.model.data.endog

    rangeResid = np.linspace(res1.min(), res1.max(), num=1000)

    fig, axs = plt.subplots(2, 2, figsize = (14,8))

    axs[0][0].hist(res1, density=True, bins=100, label="residuals")

    axs[0][0].plot(rangeResid, scipy.stats.norm.pdf(rangeResid, loc=0.0, scale=np.sqrt(sca1)),
                      label="normal distribution", ls='--')
    # 
    outliers = np.abs(res1)>4*np.sqrt(sca1)
    sns.rugplot(res1[outliers], color="g", label="outliers", ax=axs[0][0])
    axs[0][0].legend(loc="upper right")
    axs[0][0].set(title='Residual Distribution Plot')

    sm.qqplot(res1, line="s", marker='.', alpha=0.5, ax=axs[0][1]);
    axs[0][1].set(title='QQ Plot Residuals')


    axs[1][0].plot(yobs, res1, ".", alpha=0.2)
    axs[1][0].set(ylim = (-300, +300), xlabel="Cost", ylabel = "Residuals", title='Residuals vs DV')
    axs[1][0].axhline(0, color='black', ls='--');

    sm.graphics.plot_leverage_resid2(model, ax = axs[1][1])
    fig.set_tight_layout(True);
In [34]:
diagnosticPlot(model_region_no_oldest)

Exercise 9 :

9.1

Find the best Box-Cox transformation of cost used to fit model_region_no_oldest. What value do you get?

Answer.

In [39]:
cost,fitted_lambda = stats.boxcox(train['cost'])
round(fitted_lambda,2)
Out[39]:
0.49
In [89]:
sns.histplot(cost, kde=True).set(xlabel='Auto insurance cost');

They both look okay, though the square root transformation looks slightly better visually.

9.2

Refit model_region_no_oldest, but now with the transformation as suggested by the Box-Cox. Call it model_region_no_oldest_box_cox.

Answer.

In [45]:
model_region_no_oldest_bc = smf.ols(formula = "I(((cost**0.49)-1)/0.49) ~ homeowner + car_age + risk_factor " 
                            " + married_couple + C_previous"
                            " + I(age_youngest**2) + I(car_age**2) "
                            " + car_value * age_youngest "
                            " + duration_previous ", 
       data = train).fit()

model_region_no_oldest_bc.summary()
Out[45]:
OLS Regression Results
Dep. Variable: I(((cost ** 0.49) - 1) / 0.49) R-squared: 0.183
Model: OLS Adj. R-squared: 0.182
Method: Least Squares F-statistic: 110.6
Date: Tue, 06 Oct 2020 Prob (F-statistic): 0.00
Time: 13:23:48 Log-Likelihood: -23238.
No. Observations: 12348 AIC: 4.653e+04
Df Residuals: 12322 BIC: 4.672e+04
Df Model: 25
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 53.6295 1.010 53.117 0.000 51.650 55.609
car_value[T.b] -5.5617 1.311 -4.244 0.000 -8.131 -2.993
car_value[T.c] -3.9655 1.020 -3.888 0.000 -5.965 -1.966
car_value[T.d] -4.6038 1.015 -4.537 0.000 -6.593 -2.615
car_value[T.e] -4.1848 1.013 -4.131 0.000 -6.171 -2.199
car_value[T.f] -3.8360 1.014 -3.782 0.000 -5.824 -1.848
car_value[T.g] -3.7992 1.018 -3.732 0.000 -5.795 -1.804
car_value[T.h] -2.9803 1.037 -2.873 0.004 -5.014 -0.947
car_value[T.i] -2.2698 1.271 -1.786 0.074 -4.761 0.221
homeowner -0.4441 0.032 -14.037 0.000 -0.506 -0.382
car_age -0.1083 0.007 -15.747 0.000 -0.122 -0.095
risk_factor 0.0267 0.010 2.794 0.005 0.008 0.045
married_couple -0.1608 0.036 -4.423 0.000 -0.232 -0.090
C_previous -0.1830 0.013 -13.643 0.000 -0.209 -0.157
I(age_youngest ** 2) 0.0007 5.55e-05 12.430 0.000 0.001 0.001
I(car_age ** 2) 0.0008 0.000 2.570 0.010 0.000 0.001
age_youngest -0.1084 0.018 -6.011 0.000 -0.144 -0.073
car_value[T.b]:age_youngest 0.0563 0.026 2.172 0.030 0.006 0.107
car_value[T.c]:age_youngest 0.0246 0.018 1.378 0.168 -0.010 0.060
car_value[T.d]:age_youngest 0.0405 0.017 2.326 0.020 0.006 0.075
car_value[T.e]:age_youngest 0.0327 0.017 1.885 0.059 -0.001 0.067
car_value[T.f]:age_youngest 0.0268 0.017 1.544 0.123 -0.007 0.061
car_value[T.g]:age_youngest 0.0295 0.017 1.689 0.091 -0.005 0.064
car_value[T.h]:age_youngest 0.0198 0.018 1.109 0.267 -0.015 0.055
car_value[T.i]:age_youngest 0.0266 0.023 1.157 0.247 -0.018 0.072
duration_previous -0.0345 0.003 -10.819 0.000 -0.041 -0.028
Omnibus: 203.056 Durbin-Watson: 1.989
Prob(Omnibus): 0.000 Jarque-Bera (JB): 395.073
Skew: -0.054 Prob(JB): 1.63e-86
Kurtosis: 3.869 Cond. No. 5.83e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.83e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [46]:
diagnosticPlot(model_region_no_oldest_bc)

We observe that the addition of the transformation improves the properties of the residuals in an optimal way and eliminates both the problems of asymmetry in the empirical distribution and heterocesdasticity.

In [103]:
df_Metrics = pd.DataFrame({'Model': ['model_all', 'model_sig', 'model_sig_plus', 'model_region', 'model_region_no_oldest',
                                    'model_region_no_oldest_bc '],
                          'Test':['test', 'test', 'test', 'test2', 'test3', 'test3']})

df_Metrics['df_model']     = list(map(lambda x: eval(x).df_model, df_Metrics['Model']))
df_Metrics['df_resid']     = list(map(lambda x: eval(x).df_resid, df_Metrics['Model']))
df_Metrics['R2']           = list(map(lambda x: eval(x).rsquared, df_Metrics['Model']))
df_Metrics['R2_adj']       = list(map(lambda x: eval(x).rsquared_adj, df_Metrics['Model']))
df_Metrics['F_pval']       = list(map(lambda x: eval(x).f_pvalue, df_Metrics['Model']))
df_Metrics['AIC']          = list(map(lambda x: eval(x).aic, df_Metrics['Model']))
df_Metrics['BIC']          = list(map(lambda x: eval(x).bic, df_Metrics['Model']))
df_Metrics['Cond_Num']     = list(map(lambda x: eval(x).condition_number, df_Metrics['Model']))

df_Metrics
Out[103]:
Model Test df_model df_resid R2 R2_adj F_pval AIC BIC Cond_Num
0 model_all test 67.0 12318.0 0.435503 0.432432 0.0 123722.407548 124227.261449 3174.191652
1 model_sig test 64.0 12321.0 0.435180 0.432246 0.0 123723.491151 124206.072087 3171.314214
2 model_sig_plus test 74.0 12311.0 0.451707 0.448412 0.0 123375.641999 123932.466155 321637.206271
3 model_region test2 42.0 12343.0 0.379886 0.377776 0.0 124836.295130 125155.540980 318852.748516
4 model_region_no_oldest test3 25.0 12322.0 0.183738 0.182082 0.0 127781.312848 127974.265332 582561.135557
5 model_region_no_oldest_bc test3 25.0 12322.0 0.183209 0.181552 0.0 46527.453124 46720.405609 582561.135557

Conclusion

In this, you practiced creating linear models using statsmodels and iteratively trimming the input variables to go from including all the variables in the dataset to using only the most relevant variables. You excluded those variables that were statistically insignificant and removed those that had high correlation. Finally, we performed some feature engineering in an attempt to remove some tail behavior that deviates from the normal distribution to better fit our linear model. In the end, we had a very minimal model that contained variables that other insurance companies use to charge premiums that gave us insight on how we can better serve a niche population.